Information Science G2.2 A genetic algorithm for the generation of equifrequently occurring groups of attributes
نویسنده
چکیده
The identification of groups of characteristics with approximately equal frequencies of occurrence is of importance in several areas of information science. This case study describes the use of a genetic algorithm (GA) for the identification of such groups. Experiments with several text dictionaries show that the GA is able to generate groups with a high degree of equifrequency; however, the results are inferior to those produced by an existing, deterministic algorithm if the characteristics are ordered in some way. G2.2.1 Project overview Statistical analyses of many types of bibliographic entity show that their frequencies of occurrence all follow a well-marked, near-hyperbolic distribution (Wyllys 1981, Zipf 1949). Examples of this behavior include the numbers of papers published by different authors, the numbers of citations to different papers, the lengths of posting lists in inverted-file retrieval systems, and the occurrences of characters and character substrings in natural language texts. Simple information theoretic considerations suggest that such distributions will limit the efficiency with which information can be stored and retrieved (Lynch 1977, Zunde 1981), and much work has thus been undertaken with the aim of producing sets of characteristics with equal, or at least less disparate, frequencies of occurrence. The resulting sets have been used for the generation of bitstrings for text signature searching, for the compression of natural language texts, for the sorting of dictionaries, and for the generation of monograph identifiers for document delivery systems, inter alia (see, e.g. Cooper et al 1980, Cooper and Lynch 1984, Goyal 1983, Schuegraf and Heaps 1973, Williams and Khallaghi 1977, Yannakoudakis and Wu 1982). Concepts of equifrequency have also been used in the selection of access paths for numeric database management systems (Motzkin and Williams 1988) and they play a central role in the design of substructural indexing systems for databases of chemical molecules (Ash et al 1991). The work described in this paper was carried out as part of a 2.5 person-year research project, funded by the British Library Research and Development Department, to evaluate the use of genetic algorithms B1.2 (GAs) for a range of problems in information retrieval. Three main applications were studied in this project: (i) the creation of nonhierarchic document classifications, (ii) the selection of optimal weights for the indexing of query terms in ranked-output retrieval systems, and (iii) the selection of equifrequent groups as discussed below. Full details of the work are presented by Robertson and Willett (1994, 1995) while other applications of GAs in information retrieval are described by Gordon (1988), Petry et al G2.1 (1993), and Yang et al (1993), inter alia. G2.2.2 Design process Motivation for an evolutionary solution. There are two obvious ways of dividing up a file of entities with associated frequencies of occurrence to produce sets of equifrequent groupings: in the first method the order of the original file is not preserved (assuming that it was originally ordered in some meaningful way), while the second method partitions a previously sorted input file (such as an alphabetically ordered c © 1997 IOP Publishing Ltd and Oxford University Press Handbook of Evolutionary Computation release 97/1 G2.2:1 A genetic algorithm for the generation of equifrequently occurring groups of attributes dictionary), that is, the file is divided into groups while preserving the original order. These two approaches will be referred to subsequently as division and partition , respectively, and are illustrated by the following example. Consider a set of seven objects with frequencies 5, 7, 4, 6, 3, 10, and 5. It is possible to divide this set into four groups with perfect equifrequency, since the groups {5, 5}, {10}, {6, 4}, and {7, 3} all have frequencies summing to ten. But it is not possible to achieve perfect equifrequency in the present case if the frequencies are partitioned: for example, one possible set of groups, given the initial ordering above, is {5, 7}, {4, 6}, {3, 10}, and {5}, with the sums of the groups being 12, 10, 13, and 5, respectively. Thus, a divisive procedure that was able to test all of the possible partitionings for all of the possible orderings of the seven objects in this data set would be able to identify that ordering and that partition that optimized the equifrequency criterion. The partitioning procedure, conversely, can only generate possible partitions derived from the single ordering that is presented to it and is thus far less likely to be able to identify an equifrequent grouping of the frequencies. The greater simplicity of partitioning means that several partitioning algorithms have been suggested for the identification of equifrequent groupings (see, e.g. Cooper et al 1980, Cringean et al 1990, Schuegraf and Heaps 1973). Division algorithms are far less common, and appear to have been studied in an information retrieval context only by Yannakoudakis and Wu (1982). Their algorithm involves an initial allocation of frequencies to groups followed by a heuristic procedure that searches through all possible moves of the individual frequencies from each group to all other groups to find those that most increase the equifrequency of the partition. The procedure is extremely time consuming and can be used only when there are limited numbers of frequencies and groups: using frequency data from over 30 000 records in the British National Bibliography the experiments of Yannakoudakis and Wu involved dividing the 26 letters of the English alphabet into between 4 and 20 groups and dividing the 244 MARC record subfields into between 5 and 44 groups. The work reported here was carried out to determine whether the novel characteristics of the GA might enable the development of a divisive procedure that was able to process larger data sets than can be encompassed by conventional deterministic algorithms for this purpose. General description of the type of EA used. The work involved a GA, which was tested with a wide range of parametrizations. Representation description. The input to the program is a file of N frequencies, each of which denotes the number of times that a specific word in an N -element dictionary occurs in a database. The frequencies are read into an integer array of length N . An analogous N -element integer array is created to hold the number of the group (in the range [1, n] for a set of n groups) to which each of the frequencies has been assigned. The first n frequencies are assigned, one to each group, and each of the remaining frequencies is then assigned to the group with the smallest current total (thus ensuring that each group is assigned at least one value, for N ≥ n). The I th element of these two arrays thus contains the occurrence frequency of the I th word and the group to which that I th word has been allocated. Once the initial chromosome has been created in this way, the other chromosomes in the first generation are created by random rearrangement of the second array (i.e. that giving the group membership of each frequency in the input data set). The genetic operators are then applied to the array of group numbers. Fitness function. Two measures of equifrequency were used as the fitness function. The first was the relative entropy (Lynch 1977, Yannakoudakis and Wu 1982). If there are to be n groups such that each group contains P(I) occurrences, then the total number of occurrences, total freq, is given by
منابع مشابه
Optimization of concrete structure mixture plan in marine environment using genetic algorithm
Today due to increasing development and importance of petroleum activities andmarine transport as well as due to the mining of seabed, building activities such as construction of docks, platforms and structures as those in coastal areas and oceans has increased significantly. Concrete strength as one of the most important necessary parameters for designing, depends on many factors such as mixtu...
متن کاملDegree of Optimality as a Measure of Distance of Power System Operation from Optimal Operation
This paper presents an algorithm based on inter-solutions of having scheduled electricity generation resources and the fuzzy logic as a sublimation tool of outcomes obtained from the schedule inter-solutions. The goal of the algorithm is to bridge the conflicts between minimal cost and other aspects of generation. In the past, the optimal scheduling of electricity generation resources has been ...
متن کاملAppraisal of the evolutionary-based methodologies in generation of artificial earthquake time histories
Through the last three decades different seismological and engineering approaches for the generation of artificial earthquakes have been proposed. Selection of an appropriate method for the generation of applicable artificial earthquake accelerograms (AEAs) has been a challenging subject in the time history analysis of the structures in the case of the absence of sufficient recorded accelerogra...
متن کاملOptimization of e-Learning Model Using Fuzzy Genetic Algorithm
E-learning model is examined of three major dimensions. And each dimension has a range of indicators that is effective in optimization and modeling, in many optimization problems in the modeling, target function or constraints may change over time that as a result optimization of these problems can also be changed. If any of these undetermined events be considered in the optimization process, t...
متن کاملOptimization of e-Learning Model Using Fuzzy Genetic Algorithm
E-learning model is examined of three major dimensions. And each dimension has a range of indicators that is effective in optimization and modeling, in many optimization problems in the modeling, target function or constraints may change over time that as a result optimization of these problems can also be changed. If any of these undetermined events be considered in the optimization process, t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997